I have recently started DeepLearning.AI's courses. Here are my notes from the second week (Neural Networks Basics) from the first course. I hope you find them helpful.
Apologies for the pdf format. I am having a hard time getting the LaTeX to render on my site. I hope to fix this soon.
Given x where $x \in {\rm I\!R}^{n_{x}}$,
We want $\hat{y} = P(y=1| x)$
Parameters: $w \in {\rm I\!R}^{n_{x}}$ $b \in {\rm I\!R}$
Output option 1 (linear regression): $ \hat{y} = w^T x + b $
Output option 2 (logistic regression): $\hat{y} = \sigma(w^T x + b)$
In [1]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
#Create z and sigma
z = np.linspace(-5,5)
sigma = 1/(1+np.exp(-z))
#Draw prediction cut-off line
plt.axhline(0.5, color='black',ls='--')
#Label axis
plt.xlabel('z')
plt.ylabel(r'$\hat{y}$')
#Plot graph
plt.tick_params(axis='x',bottom='off',labelbottom='off')
plt.plot(z,sigma,'-',lw=3);
Loss Function:
For an individual instance, the loss function is:
$L(\hat{y},y) = -(y\log\hat{y} + (1-y)\log(1-\hat{y}))$
Intuition:
Cost Function:
Across training set, the cost function is:
$J(w,b)=-\frac{1}{m}\sum[y^{(i)}\log\hat{y}^{(i)} + (1-y^{(i)})\log(1-\hat{y}^{(i)})]$
J(w,b) is a convex function so gradient descent will not get stuck on a local minimum.
Gradient Descent Algorithm:
For cost function J(w,b) and learning rate $\alpha$,
repeat {
$w:=w-\alpha\frac{J(w,b)}{\partial w}$
$b:=b-\alpha\frac{J(w,b)}{\partial b}$
}
Capital letters indicate a matrix rather than a single training instance.
Vectorization is the art of removing for loops. For loops are much slower than matrix multiplication.
To create the product of W transpose and X, use numpy's .dot(X,Y) function and the .T method for transpose :
$z = np.dot(W.T,X) + b$
In code, we represent the partial derivatives as follows:
$dw=\frac{J(w,b)}{\partial w}$
$db=\frac{J(w,b)}{\partial b}$
After taking partial derivatives from a computation graph, we find that $\frac{J(w,b)}{\partial z}$ (the chage in cost with respect to z) is equal to:
$\sum_{i=1}^{m}(\hat{y}^{(i)}-y^{(i)})$
And so in code, we will represent this as $dZ$.
Calculate z:
$z = np.dot(W.T,X)+b$
Calculate A (convert z to [0,1] range with sigmoid function):
$A= \sigma(z)$
Calculate dZ (the change in cost with respect to z):
$dZ = A-Y$
Calculate w and b (weights and bias):
$dw = \frac{1}{m}XdZ.T$
$db = \frac{1}{m}np.sum(dZ)$
In [2]:
#Create dummy image
n=64
m=100
img = np.random.randn(n,m,3)
print('Shape of standard image: {}'.format(img.shape))
In [3]:
#Prepare for training or classification
reshaped_img = img.reshape((img.shape[0]*img.shape[1]*3,1))
print('Shape of reshaped image: {}'.format(reshaped_img.shape))
Broadcasting refers to automatic conversion of array shapes to allow for various calculations.
Given array of (n,m), adding/substracting/dividing/multiplying by an array or real numbers with various dimensions will convert them as follows:
Avoid using rank 1 arrays:
These arrays have shape of (n,). Use reshape to give them dimension of (n,1) or (1,n) to avoid tricky bugs in code.
E.g. use:
X = np.zeros((5,1))
Instead of:
X = np.zeros(5)
Use assert to check array shape:
assert(X.shape==(5,1))